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Abstract 

This paper describes a system for evaluating pedestrian 
detection algorithm results. 

The developed tool allows a human operator to annotate 
on a file all pedestrians in a previously acquired video 
sequence. A similar file is produced by the algorithm being 
tested using the same annotation engine. A matching rule 
has been established to validate the association between 
items of the two files. For each frame a statistical analyzer 
extracts the number of mis-detections, both positive and 
negative, and correct detections. Using these data, statistics 
about the algorithm behavior are computed with the aim 
of tuning parameters and pointing out recongnition weak¬ 
nesses in particular situations. 

I. Introduction 

The detection of human shapes is one of the most active 
research objective in the field of artificial vision. Various 
approaches have recently been proposed (many applications 
rely on such detectors, like automotive precrash, security 
and surveillance systems) [1], [2], [3], [4], An important 
issue at the basis of the design of a human shape detector 
is the availability of a tool for performance evaluation. 
Working on real images, because of the intrinsic problem 
complexity, some kind of external information is necessary 
in order to validate the algorithm results. 

A system for performance evaluation needs to know the 
"ground truth”, this can be obtained using two different ap¬ 
proaches: recording additional data together with processed 
images data and using the annotation based approach, 
presented in this paper. The former collects information 
about the pedestrians position using sensors different from 
vision, such as radio transmitters. The correctness of the 
algorithm can be evaluated in realtime but some problems 
may occur in cases, for example, where a pedestrian is 
partially occluded but the radio transmitter (or other) is 
anyway sensed by the detector. The other approach, the 
one dealt with in this paper, relies on a frame by frame 
manual annotation, by a human operator, of all pedestrians 
appearing in each frame of a video sequence. This is a 
post processing operation, thus images must be acquired 
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and saved on a storage device. Subsequently, the images 
are annotated in laboratory: a human operator, using a 
GUI, defines the position and size of pedestrians in each 
frame and produces a file containing the description of 
all pedestrian in the image sequence. A similar file of the 
same format is created by the pedestrian detection algorithm 
which is under test. Finally the two files are compared 
and statistics are extracted. Parameters and thresholds can 
be adjusted and their effect on the algorithm behaviour 
highlighted. 

The outline of this paper is as follows. Section 2 briefly 
reviews the state of the art in performance evaluation tools 
for vision algorithms. Section 3 describes the annotation 
tool composed by: engine, GUI, and performance ana¬ 
lyzer. In section 4, an evaluation method for algorithms 
is proposed along with a case study. Finally, the paper is 
concluded with a discussion describing results about the 
optimization of the case study and future work. 

II. PERFORMANCE EVALUATION 

This section describes the state of the art in performance 
evaluation for vision algorithms and in particular for pedes¬ 
trian detection. 

Vision applications proved their efficiency and usefulness 
in many fields but current research practices, and in particu¬ 
lar system-building techniques, are inadequate especially for 
fine tuning and filter combination testing. One key aspect 
of this problem is the inability to conduct adequate perfor¬ 
mance characterization of new technologies (like pedestrian 
detectors). Reasons of this fact are due to the complexity 
of real scenes, sometimes pedestrians are occluded by other 
obstacles, sometimes parts of framed obstacles looks like 
pedestrians (even for humans observers). 

The main purpose of a general approach to Performance 
Characterization of Computer Vision Systems [5] is the 
statistical testing, tuning, algorithmic combination and al¬ 
gorithmic re-use in order to improve algorithms reliability 
and robustness. 

The work presented in [6] uses ground truth automat¬ 
ically extracted from pseudo synthetic video to perform 
evaluation of a pedestrian tracker on typical surveillance 
images taken from a fixed camera. However when dealing 
with real images taken from moving a cameras installed 
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on a vehicle, the manual ground-tmth generation approach 
seems to be more robust. 

ViPER (Video Performance Evaluation Resource) de¬ 
scribed in [7] and [8] is a Java integrated tool for authoring 
ground truth meta-data in image sequences and evaluate 
performance of algorithms. 

A similar system is proposed in this paper. A key advan¬ 
tage of this tool is it’s integration in a complete environment 
for the development of vision algorithms. This simplify 
design and tuning of parameters allowing to directly check 
the impact, of their variation, on performances. 

The study presented in [3] points out the importance of a 
good performance evaluation method for the actual deploy¬ 
ment of systems on board of vehicles. This study was based 
on a large number of tests. Results were analyzed using 
ROC curves to highlight the impact of parameters variations 
on the algorithm performance in terms of detection rate and 
false positive rate. 

III. THE ANNOTATION TOOL 

A. Description 

Performance evaluation using the annotation tool takes 
place in three steps: supervised sequence annotation by 
a human operator, automatic sequence annotation by the 
algorithm being tested, annotations comparison and anal¬ 
ysis. Thanks to a common annotation engine, the tool 
allows the human operator and the algorithm to extract, into 
separate files, pedestrian information relative to the same 
image sequence. Pedestrians are described by means of the 
bounding boxes (BBs) framing their shape. The two files are 
compared and statistical information about the algorithm 
behavior are extracted. Each step is described below in 
detail. The performance evaluation tool structure is shown 
in figure 1. 

1) Supervised sequence annotation by a human operator: 
During this step a human operator analyzes every frame 
of a pre-recorded video sequence. For each frame in the 
sequence the operator manually draws the BB around the 
pedestrian using a graphical user interface described in 
section II1-B. For each BB, the operator also locates the 
region containing the pedestrian’s head. The head is an 
important human shape feature that can easily be found. The 
existence of informations describing heads allows to profile 
recognition performances of the algorithms that looks for 
this feature. 

It is also possible to classify the pedestrian as completely 
visible or partially occluded by an obstacle. This description 
of the sequence ground truth is stored in a file named H 
shown in figure 1. An example reporting a frame during 
the annotation process is reported in figure 2.a. 

2) Automatic sequence annotation by the algorithm un¬ 
der test: The output of a pedestrian detection algorithm 
can be described in terms of a list of BBs for each frame. 
Optionally, the algorithm can also produce information 
regarding the position of the head or some other interesting 
feature related to the pedestrian. 


If needed an additional block can be added to the algo¬ 
rithm output stage in order to translate its results in a format 
compatible with the annotation engine input. For example 
if an algorithm extracts the human shapes from source 
images its output can be converted in a list of BBs each 
one defining the pixel area in the source image occupied 
by the pedestrian’s shape. The description of the sequence 
produced by the algorithm is saved in a file named A also 
shown in figure 1. An example of BB generated by the 
algorithm under test is reported in figure 2.b. 

3) Annotations Comparison and Analysis: This is the 
last step of the performance evaluation process. It takes as 
input the two previously created files and compares them 
frame by frame, extracting statistical information about the 
algorithm behavior. Three values are calculated for each 
frame: false positives (FP), false negatives (FN), and correct 
detections (CD). 

In order to distinguish if a BB generated by the algorithm 
represents a correct detection (CD) it is necessary to match 
it to all the BBs annotated by the human operator. 

Two BBs, p and q, of area Z v and Z q respectively, are 
defined as matching if Z pq = W pq /Z p Z q is greater than 
Z T h, were W pq is the overlapped area between p and q and 
Z T h is a threshold adjustable by the user (a good value may 
be Z T h = 0.7). 

This relation embodies the following property: well over¬ 
lapped BBs generate high values of Z pq , but as the over¬ 
lapping area decreases linearly, the value of Z vq decreases 
at higher rate (square). 

Every frame of the sequence analyzed through a 
particular algorithm can be modeled with the following 
two sets: 

H n = {BBs annotated by a human operator} 

A n = {BBs annotated by the examined algorithm} 

where n is the frame number of a specific video 
sequence N frames long. 

Let the symbol |7f| represents the cardinality of X, 
namely the number of elements in the set. 

It is possible to define the matching operator 0 between 
a BB annotated by the operator and one annotated by the 
algorithm under testing in this way: 

let a z C A n and hj € H n \ 

, ^ I 1 Zmhj = max{Z n , ihk \Z ai h k > Z T h] 

j ~ \ o otherwise 

Based on this definition Va, € A„ exists at most one 
j such that a, 0 hj = 1. In fact, given a t , Z ai kj > Z t h 
for different values of j. This ambiguity is resolved by the 
maxi) function. 

A graphic representation of the matching process result 
is reported in figure 2.c. 
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Fig. 1. Block diagram of the algorithm: human and algorithm process the same video sequence producing annotation files H and A. The two files are 
compared producing statistics about algorithm performance 
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Fig. 2. (a) Annotation window during the human supervised annotation process: the currently selected BB is cyan filled (it can be resized or moved) 

non-filled green BBs are non selected BB, the yellow one is marked as occluded, the green-filled one is currently being drawn by the operator. Each 
BB is composed of two rectangles framing the pedestrian shape and head. ( b ) BBs generated by the algorithm. The red numbers indicate the distance 
of the pedestrian, the violet area represents the 3D space where stereo vision can be applied, (c) Matching phase result: BBs found by the pedestrian 
detector are presented in red, annotated BBs are in blue and yellow (if occluded), matched BBs are in green. 


Using these definitions, three different values are defined: 

I | | H rJ | 

C°n = a i® h i 

i = 0 j = i 

FPn = \A n \ - CD n 

FN n = \H n \ — CD n 

These values represent respectively the number of correct 
detections, false positives, and false negatives for the n-th 
frame of the sequence. In this way it is possible to identify 
frames in which the algorithm works fine and situations 
related to algorithm weaknesses. 

Now a set of global values is defined referred to the whole 
sequence in order to compare different algorithms working 
on that sequence. 

Af-l 

CD n 

CDR = n= .° —— (1) 

\**n I 

N~ 1 

FPf? = — - (2) 


These values are sequence specific and measure respec¬ 
tively: the correct detection rate and the false positive rate. 
FPR cannot be normalized because false positives have no 
upper limit. 

B. User Interface 

The input interface has been designed with the objective 
of reducing, as much as possible, the workload for frame 
annotation. An example of the annotation window during 
the drawing process of a new pedestrian is presented in 
figures 2.a. In figure 6.b is reported the annotation panel 
during the annotation process. The key points for reducing 
the annotation time are the following: 

Similarity between consecutive frames. Usually a frame 
in a real-time sequence contains little differences from the 
previous one. For this reason it can be assumed that a 
BB containing a pedestrian will have a similar position 
and size in the subsequent frame, possibly with some little 
corrections. To remove the need for a complete redrawing 
of BBs on every frame of a sequence, the GUI copies all 
BBs of a certain frame to the following one, leaving to the 
operator the task of adjusting size and position, as well as 
adding and deleting new and disappeared BBs. 

Easy input method. The interface has been studied keep¬ 
ing ergonomics in mind: the operator uses one hand to 

















1 


CDR 


0 

Fig. 3. Vision based pedestrian detection normalized evaluation space. 

command the mouse and the other one for the keyboard. In 
this way all important commands such as tracing, resizing 
and repositioning of BBs are directly available to the user. 
Moreover using the mouse wheel the operator can select the 
target BB to modify. Some additional keyboard commands 
allow to speed up common operations such as deleting all 
BBs in the frame. 

It has been proven that this kind of interface is user 
friendly. This GUI allows a human operator to annotate 
about 100 frames/h. This number is the average speed 
obtained from 10 different users who never used the tool 
before, each annotating 200 frames. However it is reason¬ 
able to assume this speed will increases with experience. A 
more accurate drawing of BBs can be obtained magnifying 
the tracing area. 

IV. ALGORITHMS EVALUATION SPACE 

This section contains some considerations regarding al¬ 
gorithms evaluation. 

The values CDR and FPR introduced in the previous 
section were referred to a single sequence. Indeed in order 
to have a more general and robust statistical description 
of the algorithm these values must be computed on a 
sufficiently large sequence including a wide variety of 
different scenarios. 

The 2D space < FPR, CDR > is defined as shown in 
figure 3; the optimal algorithm is placed in the point (0,1). 
Namely, all algorithms that do not give any false positive 
should stay in the segment [0, a ] with a € [0,1], Algorithms 
with bad performance fall in the right bottom part of the 
space while real cases fall inside the central area. 

This kind of evaluation allows to determine if an algo¬ 
rithm modification improves (even slightly) the recognition 
performance. For example this system is useful to fine tune 
parameters and thresholds. It is possible to evaluate the 
impact that a parameter modification has on the algorithm 
performance observing the movement of the point repre¬ 
senting the algorithm in the evaluation space. Moreover the 
inclusion of new filters can be evaluated measuring per¬ 
formance variations. Indeed the time consuming annotation 




Fig. 5. Statistics extraction screen-shot: for each frame values of CD, FP, 
FN and human annotated (in yellow) are computed. 


process that requires the human supervision is performed 
only once for every sequence. 

It is also possible to define a metric (for example the 
euclidean distance from the [0,1] point) to asses improve¬ 
ments. It is necessary to underline that the specific opti¬ 
mality criterion is strictly dependent on the application. In 
some applications, such as quality control for example, it 
may be desirable to avoid false positives and disregard false 
negatives. In other applications, such as automotive precrash 
systems, some false positives may be acceptable, even if an 
excessive number of FP reduces the user confidence in the 
recognition system. In these two cases the distance from 
the optimum point should be defined in different ways. 

v. Preliminary results and Conclusions 

The performance evaluation tool described in this paper 
has been used to evaluate pedestrian detection algorithms. In 
particular, the algorithm presented in [9] has been chosen as 
a case study. A number of sequences for nearly 1500 images 
have been manually annotated. The sequences were taken in 
different scenarios (parking lot, open field, and downtown), 
under different illumination and weather conditions, and 
framed different subjects at different distances. The aim 
of this step was to create a test set describing many of 
the the possible cases that the algorithm can deal with. In 
the test sequence, composed of 1500 images, 1897 human 
shapes were annotated as completely visible while 361 were 
marked as partially occluded. Figure 4 shows a number of 
different situations for the test sequence. 

The overall system performance shows that the correct 
detection rate is about 83% (1572/1897=82.9%). The high 
sensitivity of the algorithm (83%) comes along with an 
appreciable false positives rate 0.46 FP per frame. Indeed, 
a set of higher thresholds in the algorithm would decrease 
the number of false positives, but, at the same time would 
reduce the correct detection rate. The false negatives rate 
(426/1897=26.1%) summed up with the correct detection 
rate exceeds 100% (82.9%+26.1%=109%) since the latter 
also includes occasional detections of occluded human 
shapes. 

The main result of these tests were the statistical charac¬ 
terization of the detector behaviour and the precise iden¬ 
tification of particularly challenging segments of a large 
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Fig. 4. Examples of situations in which the human shape detection algorithms partially fails: (a) background noise generated by parked vehicles 
introduces false positives; ( b ) columns generate false positives due to their symmetry; (c) two pedestrians walking side by side mislead symmetry 
evaluation. 
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Fig. 6. (a) Statistics extraction screen-shot: for each frame values of CD, 
FP, FN are computed and cumulated up to the current frame. ( b ) Annotation 
panel during the human supervised annotation process: information about 
current operation are displayed 


sequence. The program output while extracting statistics 
from the algorithm is shown in figures 5 and 6.b. 

The presented performance evaluation tool has been 
proven to be effective though it requires a very expensive 
annotation process. The time required to annotate one frame 
is a value than can be reduced only by either modifying the 
input method (finding more efficient shortcuts for frequent 
operations) or trying to detect off-line the BB modifications. 
Thus, improvements should be in GUI refining and in 
automatic resize/reposition of the bounding boxes using 
motion detection techniques whenever possible. GUI re¬ 
finement can be done following impressions of user that 
performs longs annotations. Motion detection techniques 
such as correlation analysis between frames and optical 
flow can be used to determine the new coarse position 
and size of BBs in new frame starting from those in the 
previous frames. It is necessary to consider that the time 
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spent to perform such detection can’t be too hight in order 
to maintain the number of annotated frames/h comparable 
with the human one. Relying on an accurate prediction of 
the new BBs position, operator’s task would be reduced 
to a mere supervision activity reducing the time spent to 
annotate each single frame. 

This study shows a tool to detect positive aspects and 
weakness points of a pedestrian detector working on a given 
video sequence. The tool can also serve as a performance 
comparison method between different algorithms. 

The system was implemented using the C language and 
included in the GOLD software, a framework for vision 
applications development implemented at the University of 
Parma. 
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